myblog - Using SQL and Plotly to Perform Analysis on Temperature Data

Part 1: Constructing a SQL Database

In this blog, we’ll be using SQL and plotly to construct aesthetic graphs. The sqlite3 package allows for creating and accessing SQL databases in Python.

First, some basic imports:

import pandas as pd
import sqlite3

You should already be fairly familiar with working with Pandas dataframes. To work with SQL, first we need to establish a connection and create a cursor that can execute queries for us.

conn = sqlite3.connect("data.db")
cursor = conn.cursor()

For this post we’ll be using temperature data collected from various temperature reading stations around the world. The to_sql() function imports a given dataframe into our SQL database.

def to_db(url,name):
    '''
    Reads given .csv file into the established SQL database with given name
    '''
    df = pd.read_csv(url)
    df.to_sql(name,conn,index=False,if_exists="replace")
    
to_db("temps_stacked.csv","temperatures")
to_db("countries.csv","countries")
to_db("station-metadata.csv","stations")

Now we can check that we have three different tables in our database:

cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
cursor.fetchall()

[('temperatures',), ('countries',), ('stations',)]

Part 2: Extracting Data

Our end goal is to create a graph that addresses the following question:

How does the average yearly change in temperature vary within a given country?

To this end, we need a function to extract the specific data that we want from our database. The key here is knowing how to construct the appropriate SQL command. A tutorial on basic SQL syntax can be found here: https://www.w3schools.com/sql/default.asp

For our function, we want to be able to specify a year range as well as a month, the latter so that temperature variations as seasons change don’t affect the data.

def query_climate_database(country,year_begin,year_end,month):
    '''
    Returns Pandas dataframe of data from SQL database
    where station is located in country and reading is
    from month and in range year_begin to year_end
    '''
    
    #extract FIPS 10-4 from country
    cursor.execute(f"SELECT [FIPS 10-4] FROM countries WHERE Name='{country}' LIMIT 1")
    fips = cursor.fetchone()[0]
    
    #query SQL database for data
    cmd = f"SELECT S.name,S.latitude,S.longitude,C.Name,T.temp,T.year,T.month \
    FROM temperatures T \
    LEFT JOIN stations S \
    ON T.id = S.id \
    LEFT JOIN countries C \
    ON C.[FIPS 10-4] = SUBSTRING(T.id,1,2) \
    WHERE (year >= {year_begin}) \
    AND (year <= {year_end}) \
    AND (month == {month}) \
    AND (SUBSTRING(T.id,1,2) == '{fips}')"
    
    return pd.read_sql_query(cmd,conn)

Here’s an example of what our returned dataframe looks like:

df=query_climate_database(country = "India", 
                       year_begin = 1980, 
                       year_end = 2020,
                       month = 1)
df

	NAME	LATITUDE	LONGITUDE	Name	Temp	Year	Month
0	PBO_ANANTAPUR	14.583	77.633	India	23.48	1980	1
1	PBO_ANANTAPUR	14.583	77.633	India	24.57	1981	1
2	PBO_ANANTAPUR	14.583	77.633	India	24.19	1982	1
3	PBO_ANANTAPUR	14.583	77.633	India	23.51	1983	1
4	PBO_ANANTAPUR	14.583	77.633	India	24.81	1984	1
...	...	...	...	...	...	...	...
3147	DARJEELING	27.050	88.270	India	5.10	1983	1
3148	DARJEELING	27.050	88.270	India	6.90	1986	1
3149	DARJEELING	27.050	88.270	India	8.10	1994	1
3150	DARJEELING	27.050	88.270	India	5.60	1995	1
3151	DARJEELING	27.050	88.270	India	5.70	1997	1

3152 rows × 7 columns

Part 3: Plotting

Now we’re ready to tackle our question. To calculate the average yearly change in temperature, we’ll be creating a linear regression of temperature vs. year and extracting the linear coefficient. For this to be meaningful, our function will be taking in a min_obs parameter for the minimum amount of observations required for a station to be considered. If a station doesn’t have enough observations, we remove it from the dataframe before plotting.

Before we implement this in a function, let’s work through it with our India data. We can use the groupby() function along with the transform() function which, in contrast to apply(), retains the indexing of the original dataframe.

#removes observations from stations with less than 10 observations
df[df.groupby("NAME")["Year"].transform(lambda x:x.count())>=10]

	NAME	LATITUDE	LONGITUDE	Name	Temp	Year	Month
0	PBO_ANANTAPUR	14.583	77.633	India	23.48	1980	1
1	PBO_ANANTAPUR	14.583	77.633	India	24.57	1981	1
2	PBO_ANANTAPUR	14.583	77.633	India	24.19	1982	1
3	PBO_ANANTAPUR	14.583	77.633	India	23.51	1983	1
4	PBO_ANANTAPUR	14.583	77.633	India	24.81	1984	1
...	...	...	...	...	...	...	...
3140	SHILONG	25.600	91.890	India	10.40	1986	1
3141	SHILONG	25.600	91.890	India	11.20	1990	1
3142	SHILONG	25.600	91.890	India	11.99	2010	1
3143	SHILONG	25.600	91.890	India	9.93	2011	1
3144	SHILONG	25.600	91.890	India	9.68	2012	1

3106 rows × 7 columns

Notice that we’ve removed data from the DARJEELING station, which only had 7 observations.

Next, we import from sklearn, which you should already be familiar with, and write a function that finds the average yearly temperature increase using a linear regression model. We can then use this to add a column to the dataframe.

from sklearn.linear_model import LinearRegression

def lin_coeff(data):
    '''
    Returns linear coefficient of year vs. temp regression fitted to data
    '''
    reg = LinearRegression().fit(data[["Year"]],data["Temp"])
    return round(reg.coef_[0],4)

data=df.groupby(["NAME","LATITUDE","LONGITUDE"]).apply(lin_coeff).reset_index()
data

	NAME	LATITUDE	LONGITUDE	0
0	AGARTALA	23.8830	91.2500	-0.0062
1	AGRA	27.1667	78.0333	-0.0954
2	AHMADABAD	23.0670	72.6330	0.0067
3	AKOLA	20.7000	77.0330	-0.0018
4	AKOLA	20.7000	77.0670	-0.0059
...	...	...	...	...
101	TRIVANDRUM	8.5000	77.0000	0.0229
102	UDAIPUR_DABOK	24.6170	73.8830	0.0724
103	VARANASI_BABATPUR	25.4500	82.8670	-0.0130
104	VERAVAL	20.9000	70.3670	0.0248
105	VISHAKHAPATNAM	17.7170	83.2330	-0.0340

106 rows × 4 columns

Things are looking pretty good! It’s not ideal that our column title is just “0”, but at this point we’re ready to write our plotting function using what we’ve just figured out. We’ll be relying on the plotly.express module to do so.

from plotly import express as px
import plotly.io as pio 

pio.renderers.default= 'iframe'

month_dict = {1 : "January",
              2 : "February",
              3 : "March",
              4 : "April",
              5 : "May",
              6 : "June",
              7 : "July",
              8 : "August",
              9 : "September",
              10 : "October",
              11 : "November",
              12 : "December"}

def temperature_coefficient_plot(country,year_begin,year_end,month,min_obs,**kwargs):
    '''
    Plots average yearly increase in temperature at stations in given country
    
    Parameters:
    country - country from which station data is taken
    year_begin - earliest year from which data is taken
    year_end - latest year from which data is taken
    month - month from which data is taken (prevents seasonal changes in temperature
                                            from influencing data)
    min_obs - minimum number of observations required per station
    '''
    df = query_climate_database(country,year_begin,year_end,month)
    
    #remove station data without enough observations
    df = df[df.groupby("NAME")["Year"].transform(lambda x:x.count()) >= min_obs]
    
    #calculate average yearly increase at each station
    df = df.groupby(["NAME","LATITUDE","LONGITUDE"]).apply(lin_coeff).reset_index()
    df = df.rename(columns={0:"Estimated Yearly Increase (\u00b0C)"})
    
    #creates scatterplot with color corresponding to change in temperature
    fig = px.scatter_mapbox(df,
                            lat = "LATITUDE",
                            lon = "LONGITUDE",
                            hover_name = "NAME",
                            color = "Estimated Yearly Increase (\u00b0C)",
                            color_continuous_midpoint = 0,
                            title = f"Average yearly increase in temperature in {month_dict[month]}<br>for stations in {country} from {year_begin} - {year_end}",
                            **kwargs)
    return fig

Here are two examples of what the graph looks like:

#set color scale for points
color_map = px.colors.diverging.RdGy_r

fig = temperature_coefficient_plot("India", 1980, 2020, 1, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)
fig.show()

fig = temperature_coefficient_plot("China", 1975, 2015, 8, 
                                   min_obs = 10,
                                   zoom = 2,
                                   mapbox_style="carto-positron",
                                   color_continuous_scale=color_map)
fig.show()

Part 4: More Fun Stuff

Now that we’re more familiar with SQL and plotly, let’s ask some more questions we can use our tools to visualize. For example,

How does longitude affect how greatly temperature varies across time?

The process is more or less the same as before: we first define a function that uses a SQL query to return the data that we want, and then we write a plotter function that uses plotly.express.

def query_lon_temp(lat_min,lat_max):
    '''
    Returns dataframe containing longitudinal and temperature data within given latitude range
    '''
    cmd = f"SELECT S.longitude,T.temp,T.year,T.month \
    FROM temperatures T \
    LEFT JOIN stations S \
    ON T.id = S.id \
    WHERE (S.latitude >= {lat_min}) AND (S.latitude <= {lat_max})"
    
    return pd.read_sql_query(cmd,conn)

def scatter_temp_lon_all(lat_min,lat_max):
    '''
    Plots standard deviation of temperature data at each longitude for stations within
    given latitude range
    '''
    df = query_lon_temp(lat_min,lat_max)
    
    #calculate standard deviation of temperature data at each longitude
    df = df.groupby("LONGITUDE").apply(lambda x:x["Temp"].std()).reset_index()
    df = df.rename(columns={0:"Temp StdDev (\u00b0C)"})
    
    fig = px.scatter(data_frame = df, 
                 x = "LONGITUDE", 
                 y = "Temp StdDev (\u00b0C)",
                 title = f"Temperature Variation vs. Longitude at Latitudes {lat_min}\u00b0 to {lat_max}\u00b0",
                 opacity = 0.5)
    return fig

scatter_temp_lon_all(-10,10)

We see the most variation around the tropics, in the middle of each hemisphere, while there is very little variation at the poles. While temperature at the equator varies somewhat, how much it varies stays relatively consistent.

A simpler question we can ask is:

What is the distribution of temperatures at each station?

Obviously there are too many stations to all be reasonably visualized, so our functions take in a list of stations to be plotted.

def query_temps(stations,month):
    '''
    Returns dataframe with data read from given stations at a particular month
    '''
    cmd = f"SELECT S.name,T.temp,T.year \
    FROM temperatures T \
    LEFT JOIN stations S \
    ON T.id = S.id \
    WHERE (S.name IN {tuple(stations)}) AND (T.month == {month})"
    
    return pd.read_sql_query(cmd,conn)

def hist_temps(stations, month):
    '''
    Plots histogram of temperature distribution with stations as facets
    '''
    df = query_temps(stations, month)
    df = df.rename(columns={"Temp":"Temperature Readings (\u00b0C)","NAME":"Station"})
    
    fig = px.histogram(df, 
                       x = "Temperature Readings (\u00b0C)",
                       nbins = 30,
                       facet_row = "Station",
                       title = f"Distribution of Temperature Readings (\u00b0C) in {month_dict[month]}")
    
    return fig

Here’s what it looks like:

stations = ["DUBAI_INTL","HERAT","SAVE"]
hist_temps(stations,3)

We’re pretty much done, but one last thing: don’t forget to close your connection to the database!

conn.close()